Tidyverse

tidyverse.org defines Tidyverse as

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

library(tidyverse)
library(dplyr)
library(scater)

We will use single-cell RNA sequencing on 6826 stem cells from Chronic myelomonocytic leukaemia (CMML) patients and healthy controls using the droplet-based, ultra-high-throughput 10x platform. We found substantial inter and intra-patient heterogeneity, with CMML stem cells displaying distinctive transcriptional programs. Compared with normal controls, CMML stem cells exhibited transcriptomes characterized by increased expression of myeloid-lineage and cell cycle genes, and lower expression of genes selectively expressed by normal haematopoietic stem cells.

sce <- readRDS('sce.rds')
sce
class: SingleCellExperiment 
dim: 12695 6826 
metadata(0):
assays(3): counts logcounts norm_exprs
rownames(12695): FO538757.2 AP006222.2 ... AC004556.1 AC240274.1
rowData names(12): id symbol ... total_counts log10_total_counts
colnames(6826): AAACCTGCACCGATAT-1 AAACGGGCACGACTCG-1 ... TTTGGTTTCATCTGCC-11 TTTGTCAGTAGGAGTC-11
colData names(59): barcode Sample ... sizeFactor cellType
reducedDimNames(1): tSNE
altExpNames(0):

Tibble

  • Tibbles are data-frames
  • tibble() does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!)
  • it never changes the names of variables, and it never creates row names.
  • tibble can have column names that are not valid R variable names, aka non-syntactic names.
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb

Pipe %>%

Pipe %>% passes the output from one stage to the other.

tbl_df(colData(sce))
names(colData(sce))
 [1] "barcode"                                        "Sample"                                        
 [3] "total_features"                                 "log10_total_features"                          
 [5] "pct_counts_top_50_features"                     "pct_counts_top_100_features"                   
 [7] "pct_counts_top_200_features"                    "pct_counts_top_500_features"                   
 [9] "total_features_endogenous"                      "log10_total_features_endogenous"               
[11] "pct_counts_top_50_features_endogenous"          "pct_counts_top_100_features_endogenous"        
[13] "pct_counts_top_200_features_endogenous"         "pct_counts_top_500_features_endogenous"        
[15] "total_features_feature_control"                 "log10_total_features_feature_control"          
[17] "total_features_Mt"                              "log10_total_features_Mt"                       
[19] "is_cell_control"                                "total_features_by_counts"                      
[21] "log10_total_features_by_counts"                 "total_counts"                                  
[23] "log10_total_counts"                             "pct_counts_in_top_50_features"                 
[25] "pct_counts_in_top_100_features"                 "pct_counts_in_top_200_features"                
[27] "pct_counts_in_top_500_features"                 "total_features_by_counts_endogenous"           
[29] "log10_total_features_by_counts_endogenous"      "total_counts_endogenous"                       
[31] "log10_total_counts_endogenous"                  "pct_counts_endogenous"                         
[33] "pct_counts_in_top_50_features_endogenous"       "pct_counts_in_top_100_features_endogenous"     
[35] "pct_counts_in_top_200_features_endogenous"      "pct_counts_in_top_500_features_endogenous"     
[37] "total_features_by_counts_feature_control"       "log10_total_features_by_counts_feature_control"
[39] "total_counts_feature_control"                   "log10_total_counts_feature_control"            
[41] "pct_counts_feature_control"                     "pct_counts_in_top_50_features_feature_control" 
[43] "pct_counts_in_top_100_features_feature_control" "pct_counts_in_top_200_features_feature_control"
[45] "pct_counts_in_top_500_features_feature_control" "total_features_by_counts_Mt"                   
[47] "log10_total_features_by_counts_Mt"              "total_counts_Mt"                               
[49] "log10_total_counts_Mt"                          "pct_counts_Mt"                                 
[51] "pct_counts_in_top_50_features_Mt"               "pct_counts_in_top_100_features_Mt"             
[53] "pct_counts_in_top_200_features_Mt"              "pct_counts_in_top_500_features_Mt"             
[55] "CellCycle"                                      "Cluster"                                       
[57] "Phase"                                          "sizeFactor"                                    
[59] "cellType"                                      
tbl_df(colData(sce)) %>%
  group_by(Sample) %>%
  summarise(
    total.features = mean(total_features),
    total.counts = mean(total_counts)
  )
`summarise()` ungrouping output (override with `.groups` argument)

dplyr - Functions as verbs.

The most useful

  • select(): select columns
  • mutate(): create new variables, change existing
  • filter(): subset your data by some criterion
  • summarize(): summarize your data in some way
  • group_by(): group your data by a variable
  • slice(): grab specific rows
  • select(): select an observation

Some others

  • count(): count your data
  • arrange(): arrange your data by a column or variable
  • distinct(): gather all distinct values of a variable
  • n_distinct(): count how many distinct values you have (only works with summarize)
  • n(): count how many observation you have for a subgroup
  • sample_n(): Grab an N sample of your data
  • ungroup(): ungroup grouped data by a variable
  • top_n(): get the top N number of entries from a data frame

__ To make it easier we copy the metadata for our SingleCellExperiment object sce to d

d <- tbl_df(colData(sce))

Select : To select collumns

select(d, Sample, Cluster, cellType)
d %>% 
  select(Sample, Cluster, cellType)

Filter : To select rows

d %>% 
  filter(cellType == "HSC")
d %>% 
  select(barcode, Sample, total_features, cellType, Cluster) %>%
  filter(Sample == "BC572")
NA
d %>% 
  filter(cellType == "Erythrocytes", pct_counts_Mt > 1.5) %>% 
  select(barcode, Sample, pct_counts_Mt, cellType, Cluster)

Mutate:

To create new variables in the data table:

d_exp <- d
d_exp <- cbind(d_exp, t(logcounts(sce)[c('KLF4','RUNX1','EGR1'),]))
d_exp
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>%
  select(barcode, Sample, cellType, Klf4Diff)

Arrange:

To order the data by a particular variable:

d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  slice(1)

Slice:

To slice your data by rows:

# The top 5 goleadas?
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  slice(1:5)  # slice_max here would also do the trick
# The top 5 goleadas?
d_exp %>% 
  mutate(Klf4Diff = (KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  select(barcode, Sample, cellType, Klf4Diff) %>%
  slice_min(Klf4Diff, n = 5)

Group by + sumarize : forget about loops

First: group by a particular variables Second: summarize the data with new statistics. Summarize: Turn many rows into one.

Examples:

  • min(x) - minimum value of vector x.
  • max(x) - maximum value of vector x.
  • mean(x) - mean value of vector x.
  • median(x) - median value of vector x.
  • quantile(x, p) - pth quantile of vector x.
  • sd(x) - standard deviation of vector x.
  • var(x) - variance of vector x.
  • IQR(x) - Inter Quartile Range (IQR) of vector x.
  • diff(range(x)) - total range of vector x.
d %>% 
  group_by(cellType) %>% 
  summarise(mean_total_counts = mean(total_counts, na.rm = TRUE), sd_total_counts = sd(total_counts), 
     mean_pct_Mt_count = mean(pct_counts_Mt), count = n()) %>% 
  #ungroup() %>% 
  slice_max(., n=20, order_by = mean_total_counts)  # note here, it does 
`summarise()` ungrouping output (override with `.groups` argument)

Note: mutate() either changes an existing column or adds a new one. summarise() calculates a single value (per group). As you can see, in the first example, new column is added.

d %>% 
count(Sample, cellType)

Plotting in R using ggplot2

GGPlot2 is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics piece by piece (Wickham et al. 2017).

The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. According to the ggplot2 concept, a plot can be divided into different fundamental parts:

Plot = Data + Aesthetics + Geometry

  1. Data: a data frame
  2. Aesthetics: used to indicate the x and y variables. It can be also used to control the color, the size and the shape of points, etc…..
  3. Geometry: corresponds to the type of graphics (scatter plot, histogram, box plot, line plot, ….)
  4. additional layers for customization — title, labels, axis, etc.

First plotting

The main function in the ggplot2 package is ggplot(), which can be used to initialize the plotting system with data and x/y variables.

For example, the following R code takes the KLF4 and RUNX1 data set to initialize the ggplot and then a layer (geom_point()) is added onto the ggplot to create a scatter plot of x = KLF4 by y = RUNX1:

  1. Data= d_exp
  2. Aesthetic=: aes(x=KLF4, y=RUNX1)
  3. Geometry= geom_point()
ggplot(d_exp, aes(x=KLF4, y=RUNX1))

ggplot(d_exp, aes(x=KLF4, y=RUNX1)) + geom_point()

ggplot(d_exp, aes(x=KLF4, y=RUNX1)) + geom_point(size = 1.2, color = "steelblue", shape = 21)

It’s also possible to control points shape and color by a grouping variable (here, Sample). For example, in the code below, we map points color and shape to the datasets grouping variable.

Note that, a ggplot can be holded in a variable, say p, to be printed later

# Control points color by groups
ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))


# Change the default color manually.
# Use the scale_color_manual() function
p <- ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97"))
print(p)

GGPlot theme

Note that, the default theme of ggplots is theme_gray() (or theme_grey()), which is theme with grey background and white grid lines. More themes are available for professional presentations or publications. These include: theme_bw(), theme_classic() and theme_minimal().

To change the theme of a given ggplot (p), use this: p + theme_classic().

p <- ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97"))
p <- p + theme_classic()
print(p)

df <- reducedDim(sce)
head(df)
                       [,1]     [,2]
AAACCTGCACCGATAT-1 31.75737 41.45222
AAACGGGCACGACTCG-1 39.95130 44.24846
AAAGCAATCCTAAGTG-1 35.30692 40.31110
AAAGTAGGTGATGATA-1 37.10719 43.00212
AAAGTAGTCTCGCTTG-1 40.69935 44.52304
AACACGTGTTGGTAAA-1 38.65847 31.53103

Adding layers to ggplot, Lines (Prediction Line)

A plot constructed with ggplot can have more than one geom. In that case the mappings established in the ggplot() call are plot defaults that can be added to or overridden. Our plot could use a regression line:

d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_line(aes(y = pred.SC)) +
  theme_classic()

Title, xlab & ylab

df <- as.data.frame(reducedDim(sce))
df$Sample <- colData(sce)$Sample
p <- ggplot(df, aes(x=V1, y=V2))+
  geom_point(size = 0.4, aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) + 
  ggtitle('t-SNE plot for Samples') + 
  xlab('tSNE-1') + 
  ylab('tSNE-2') + 
  theme_classic()
print(p)

df <- as.data.frame(reducedDim(sce))
df$Sample <- colData(sce)$Sample
p <- ggplot(df, aes(x=V1, y=V2))+
  geom_point(size = 0.4, aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) + 
  ggtitle('t-SNE plot for Samples') + 
  xlab('tSNE-1') + 
  ylab('tSNE-2') + 
  theme_classic() + 
  guides(colour = guide_legend(override.aes = list(size=4)))

p

Histogram

ggplot(d_exp, aes(x=total_counts)) + geom_histogram() + theme_classic() 

Density plot

df <- data.frame(x=log10(sce$total_counts+1), Sample = sce$Sample)
ggplot(df,
       aes(x = x, fill = as.factor(Sample))) + 
       geom_density(alpha = 0.5) +
       labs(x = expression('log'[10]*'(Library Size)'), title = "Total reads density", fill = "Sample") + 
       theme_classic(base_size = 14) + # Setting the base size text for plots
       scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) # Need to set fill manual

Facet

df <- data.frame(x=log10(sce$total_counts+1), Sample = sce$Sample)
ggplot(df,
       aes(x = x, fill = as.factor(Sample))) + 
       geom_density(alpha = 0.5) +
       labs(x = expression('log'[10]*'(Library Size)'), title = "Total reads density", fill = "Sample") + 
       theme_classic(base_size = 14) + # Setting the base size text for plots
       scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) +  # Need to set fill manual 
  facet_wrap(~Sample)

Statistical Transformations

Statistical Transformations

Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:

  • for a smoother smother the y values must be transformed into predicted values

d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_smooth() +
  theme_classic()

NA
NA
d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_smooth(method = "lm") +
  theme_classic()

---
title: "Tidyverse & ggplot2 - ICD Bootcamp"
output:
  html_notebook:
    theme: united
    toc: yes
    toc_float:
      collapsed: false
      smooth_scroll: true
editor_options: 
  chunk_output_type: inline
---



# Tidyverse
tidyverse.org defines Tidyverse as

> The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

```{r}
library(tidyverse)
library(dplyr)
library(scater)
```

We will use single-cell RNA sequencing on 6826 stem cells from Chronic myelomonocytic leukaemia (CMML) patients and healthy controls using the droplet-based, ultra-high-throughput 10x platform. We found substantial inter and intra-patient heterogeneity, with CMML stem cells displaying distinctive transcriptional programs. Compared with normal controls, CMML stem cells exhibited transcriptomes characterized by increased expression of myeloid-lineage and cell cycle genes, and lower expression of genes selectively expressed by normal haematopoietic stem cells. 

```{r}
sce <- readRDS('sce.rds')
sce
```

## Tibble
- Tibbles are data-frames
- `tibble()` does much less: it never changes the type of the inputs (e.g. it never converts strings to factors!)
- it never changes the names of variables, and it never creates row names.
- tibble can have column names that are not valid R variable names, aka *non-syntactic* names.
```{r}
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
```

## Pipe `%>%`
Pipe `%>%` passes the output from one stage to the other.
```{r}
tbl_df(colData(sce))
```
```{r}
names(colData(sce))
```


```{r}
tbl_df(colData(sce)) %>%
  group_by(Sample) %>%
  summarise(
    total.features = mean(total_features),
    total.counts = mean(total_counts)
  )
```

## `dplyr` - Functions as verbs.

__The most useful__

- `select()`: select columns
- `mutate()`: create new variables, change existing
- `filter()`: subset your data by some criterion
- `summarize()`: summarize your data in some way
- `group_by()`: group your data by a variable
- `slice()`: grab specific rows
- `select()`: select an observation

__Some others__

- `count()`: count your data
- `arrange()`: arrange your data by a column or variable
- `distinct()`: gather all distinct values of a variable
- `n_distinct()`: count how many distinct values you have (only works with summarize)
- `n()`: count how many observation you have for a subgroup
- `sample_n()`: Grab an N sample of your data
- `ungroup()`: ungroup grouped data by a variable
- `top_n(`): get the top N number of entries from a data frame

__ To make it easier we copy the metadata for our `SingleCellExperiment` object `sce` to d

```{r}
d <- tbl_df(colData(sce))
```

### `Select` : To select collumns
```{r}
select(d, Sample, Cluster, cellType)
```

```{r}
d %>% 
  select(Sample, Cluster, cellType)
```

### `Filter` : To select rows
```{r}
d %>% 
  filter(cellType == "HSC")
```

```{r}
d %>% 
  select(barcode, Sample, total_features, cellType, Cluster) %>%
  filter(Sample == "BC572")

```

```{r}
d %>% 
  filter(cellType == "Erythrocytes", pct_counts_Mt > 1.5) %>% 
  select(barcode, Sample, pct_counts_Mt, cellType, Cluster)
```

### `Mutate`: 
To create new variables in the data table:
```{r}
d_exp <- d
d_exp <- cbind(d_exp, t(logcounts(sce)[c('KLF4','RUNX1','EGR1'),]))
```

```{r}
d_exp
```

```{r}
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>%
  select(barcode, Sample, cellType, Klf4Diff)
```



### `Arrange`: 
To order the data by a particular variable:

```{r}
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  slice(1)
```
### `Slice`: 
To slice your data by rows:

```{r}
# The top 5 goleadas?
d_exp %>% 
  mutate(Klf4Diff = abs(KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  slice(1:5)  # slice_max here would also do the trick
```

```{r}
# The top 5 goleadas?
d_exp %>% 
  mutate(Klf4Diff = (KLF4 - RUNX1)) %>% 
  arrange(desc(Klf4Diff)) %>% 
  select(barcode, Sample, cellType, Klf4Diff) %>%
  slice_min(Klf4Diff, n = 5)
```


### Group by + sumarize : forget about loops

__First__: group by a particular variables
__Second__: summarize the data with new statistics.
__Summarize__: Turn many rows into one.

Examples:

- min(x) - minimum value of vector x.
- max(x) - maximum value of vector x.
- mean(x) - mean value of vector x.
- median(x) - median value of vector x.
- quantile(x, p) - pth quantile of vector x.
- sd(x) - standard deviation of vector x.
- var(x) - variance of vector x.
- IQR(x) - Inter Quartile Range (IQR) of vector x.
- diff(range(x)) - total range of vector x.

```{r}
d %>% 
  group_by(cellType) %>% 
  summarise(mean_total_counts = mean(total_counts, na.rm = TRUE), sd_total_counts = sd(total_counts), 
     mean_pct_Mt_count = mean(pct_counts_Mt), count = n()) %>% 
  #ungroup() %>% 
  slice_max(., n=20, order_by = mean_total_counts)  # note here, it does 
```

__Note: `mutate()` either changes an existing column or adds a new one. `summarise()` calculates a single value (per group). As you can see, in the first example, new column is added.__

```{r}
d %>% 
count(Sample, cellType)
```

# Plotting in R using `ggplot2`

`GGPlot2` is a powerful and a flexible R package, implemented by Hadley Wickham, for producing elegant graphics piece by piece (Wickham et al. 2017).

The `gg` in `ggplot2` means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. According to the ggplot2 concept, a plot can be divided into different fundamental parts: 

> Plot = Data + Aesthetics + Geometry


1. __Data:__ a data frame
2. __Aesthetics:__ used to indicate the x and y variables. It can be also used to control the color, the size and the shape of points, etc…..
3. __Geometry:__ corresponds to the type of graphics (scatter plot, histogram, box plot, line plot, ….)
4. additional layers for customization — title, labels, axis, etc.


## First plotting
The main function in the `ggplot2` package is `ggplot()`, which can be used to initialize the plotting system with __data__ and __x/y__ variables.

For example, the following R code takes the `KLF4` and `RUNX1` data set to initialize the `ggplot` and then a layer (geom_point()) is added onto the ggplot to create a scatter plot of x = KLF4 by y = RUNX1:

1. __Data=__ `d_exp`
2. __Aesthetic=:__ aes(x=KLF4, y=RUNX1)
3. __Geometry=__ `geom_point()`

```{r}
ggplot(d_exp, aes(x=KLF4, y=RUNX1))
```

```{r}
ggplot(d_exp, aes(x=KLF4, y=RUNX1)) + geom_point()
```

```{r}
ggplot(d_exp, aes(x=KLF4, y=RUNX1)) + geom_point(size = 1.2, color = "steelblue", shape = 21)
```

It’s also possible to control points shape and color by a grouping variable (here, `Sample`). For example, in the code below, we map points `color` and `shape` to the datasets grouping variable.

Note that, a `ggplot` can be holded in a variable, say `p`, to be printed later

```{r}
# Control points color by groups
ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))

# Change the default color manually.
# Use the scale_color_manual() function
p <- ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97"))
print(p)
```

## GGPlot theme

Note that, the default theme of ggplots is `theme_gray()` (or `theme_grey()`), which is theme with grey background and white grid lines. More themes are available for professional presentations or publications. These include: `theme_bw()`, `theme_classic()` and `theme_minimal()`.

To change the theme of a given ggplot (p), use this: `p + theme_classic()`. 

```{r}
p <- ggplot(d_exp, aes(x=KLF4, y=RUNX1))+
  geom_point(aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97"))
p <- p + theme_classic()
print(p)

```
```{r}
df <- reducedDim(sce)
head(df)
```

## Adding layers to ggplot, Lines (Prediction Line)
A plot constructed with ggplot can have more than one geom. In that case the mappings established in the `ggplot()` call are plot defaults that can be added to or overridden. Our plot could use a regression line:
```{r}
d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_line(aes(y = pred.SC)) +
  theme_classic()
```


## `Title`, `xlab` & `ylab`
```{r}
df <- as.data.frame(reducedDim(sce))
df$Sample <- colData(sce)$Sample
p <- ggplot(df, aes(x=V1, y=V2))+
  geom_point(size = 0.4, aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) + 
  ggtitle('t-SNE plot for Samples') + 
  xlab('tSNE-1') + 
  ylab('tSNE-2') + 
  theme_classic()

p
```

```{r}
df <- as.data.frame(reducedDim(sce))
df$Sample <- colData(sce)$Sample
p <- ggplot(df, aes(x=V1, y=V2))+
  geom_point(size = 0.4, aes(color = Sample))+
  scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) + 
  ggtitle('t-SNE plot for Samples') + 
  xlab('tSNE-1') + 
  ylab('tSNE-2') + 
  theme_classic() + 
  guides(colour = guide_legend(override.aes = list(size=4)))

p
```
## Histogram

```{r}
ggplot(d_exp, aes(x=total_counts)) + geom_histogram() + theme_classic() 
```


## Density plot
```{r}
df <- data.frame(x=log10(sce$total_counts+1), Sample = sce$Sample)
ggplot(df,
       aes(x = x, fill = as.factor(Sample))) + 
       geom_density(alpha = 0.5) +
       labs(x = expression('log'[10]*'(Library Size)'), title = "Total reads density", fill = "Sample") + 
       theme_classic(base_size = 14) + # Setting the base size text for plots
       scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) # Need to set fill manual
```

## Facet

```{r}
df <- data.frame(x=log10(sce$total_counts+1), Sample = sce$Sample)
ggplot(df,
       aes(x = x, fill = as.factor(Sample))) + 
       geom_density(alpha = 0.5) +
       labs(x = expression('log'[10]*'(Library Size)'), title = "Total reads density", fill = "Sample") + 
       theme_classic(base_size = 14) + # Setting the base size text for plots
       scale_fill_manual(values = c("#00AFBB", "#E7B800", "#FC4E07", "#A2AFBB", "#17B8AB", "#3F4E77", "#FCFA27", "#BFAFFB", "#69B89B", "#7F4E97")) +  # Need to set fill manual 
  facet_wrap(~Sample)
```

# Statistical Transformations
## Statistical Transformations
Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations:

- for a smoother smother the y values must be transformed into predicted values

```{r}
d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_smooth() +
  theme_classic()
```

```{r}
d_exp$pred.SC <- predict(lm(RUNX1 ~ KLF4, data = d_exp))

ggplot(d_exp, aes(x = KLF4, y = RUNX1)) + 
  geom_point(aes(color = Sample)) +
  geom_smooth(method = "lm") +
  theme_classic()
```




